Eecient, Language-based Checkpointing for Massively Parallel Programs
نویسندگان
چکیده
Checkpointing and restart is an approach to ensuring forward progress of a program in spite of system failures or planned interruptions. We investigate issues in checkpointing and restart of programs running on massively parallel computers. We identify a new set of issues that have to be considered for the MPP platform, based on which we have designed an approach based on the language and run-time system. Hence our checkpointing facility can be used on virtually any parallel machine in a portable manner, irrespective of whether the operating system supports checkpointing. We present methods to make checkpointing and restart space-and time-eecient, including object-speciic functions that save the state of an object. We present techniques to automatically generate checkpointing code for parallel objects, without programmer intervention. We also present mechanisms to allow the programmer to easily incorporate application speciic knowledge selectively to make the checkpointing more eecient. The techniques developed here have been implemented in the Charm++ parallel object-oriented programming language and run-time system. Performance results are presented for the checkpointing overhead of programs 1 running on parallel machines.
منابع مشابه
ELMO: extending (sequential) languages with migratable objects-compiler support
EEcient task migration is an important feature in parallel and distributed programs, in particular to support checkpointing and recovery for fault tolerance. It is also very useful in distributed environments like networks of workstations where external loads are often unpredictable and dynamic in nature. We propose simple language extensions (ELMO) to existing sequential programming languages ...
متن کاملAn Object-oriented Implementation Model for the Promoter Language Technical Report
The PROMOTER programming language is designated for data parallel applications that are to run on massively parallel computers with distributed memory. This paper presents an object-oriented implementation model for the PROMOTER language. An object-oriented approach to compile data-parallel programs to message passing programs can reduce design complexity, facilitate reuse of components, and ea...
متن کاملProject Triton: towards Improved Programmability of Parallel Computers Compilation Techniques. Triton/1 Parallel Architecture
This paper appeard in: The main objective of Project Triton is adequate programmability of massively parallel computers. This goal can be achieved by tightly coupling the design of programming languages and parallel hardware. The approach taken in the Project Triton is to let high-level, machine independent parallel programming languages drive the design of parallel hardware. This approach perm...
متن کاملTransformation Based Development of Eecient Programs for Massively Parallel Architectures
This paper presents a methodology that is used to detect predeened algorithmic structures (skeletons) in a ne granular program speciication. For each skeleton the best mapping on a particular massively parallel system is known. The skeleton identiication process helps in making good mapping decisions.
متن کاملAutomatic Parallel Program Checkpointing in Message-Passing Environments
Problem of efficient cluster resources usage is very important, because of high demand for parallel computations. Checkpointing allows to manage cluster computing time more efficiently. In this article parallel programs checkpointing problems are discussed and implementation of automatic parallel checkpointing systems for MPI programs is presented. It is based on simple user-space portable chec...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007